You've met six ideas one at a time. Now watch them snap together into a single machine that reads, understands, and writes. No new magic — just the pieces you already own, assembled into a mind.
That's the whole trick. Give it “The cat sat on the…” and it returns a probability for every possible next word. Pick one, glue it on, and ask again. Do that a few hundred times and you have an essay. Everything else is plumbing to make that one guess good.
Here is the entire pipeline a word travels through. Each stage is powered by a chapter you've already finished — the colors tell you which.
A model can't read letters; it reads numbers. So first we slice text into tokens — whole words, or chunks of them — and hand each a fixed ID from a dictionary of maybe 50,000 entries. “Tokenization” itself becomes token·ization.
An ID is just a name tag — it carries no meaning. So each token is looked up in a giant table and replaced by a vector: a list of hundreds of numbers, an arrow in a vast space of meaning. The model learns this table so that related words land near each other, and directions become concepts. That's Chapter 4, exactly: king − man + woman lands near queen.
There's a catch: the model looks at all words at once and has no built-in sense of order. “Dog bites man” and “man bites dog” would look identical. The fix is beautiful — add a blend of sine and cosine waves of different frequencies to each position, a unique fingerprint of “where am I in the sentence.” The waves you spun in Chapter 1 are how an LLM knows word order.
This is the heart of the transformer — and it's just two ideas you already have, holding hands. To understand a word, the model asks: which other words should I pay attention to? It scores every pair with a dot product (how aligned are their arrows?), shrinks the scores a little (dividing by √dₖ) so they don’t blow up, then runs them through softmax so they become percentages of attention. Each word becomes a blend of the others’ value vectors, weighted by those percentages. Pick a word and watch where it looks.
real models run this with hundreds of dimensions and many “heads” at once — but it is exactly this: score by scaled dot product, normalize by softmax, blend the value vectors.
After attention mixes information between words, each word's vector is pushed through a small feed-forward network — a couple of matrix multiplications that reshape it into a richer representation. Attention + transform together make one layer. Then we do it again. And again — GPT-style models stack dozens of identical layers, each refining the meaning a little more, exactly the “chain of transformations” from Chapter 5.
After the last layer, the final word's vector is multiplied out into one score for every token in the vocabulary. Softmax squashes those scores into a probability distribution — and temperature decides how boldly to choose. Sample one word, append it to the sentence, and run the whole pipeline again. That loop — predict, append, repeat — is called autoregression, and it's literally the toy you played with at the end of Chapter 6.
A fresh model is random — the embedding table, the attention weights, every matrix is noise. We fix that by showing it oceans of real text with the next word hidden, and asking it to guess. When it's wrong, we measure how surprised it should have been by the true answer — the cross-entropy loss from Chapter 6.
Then Chapter 2 takes over. We ask the derivative of that loss with respect to every single weight — which way is downhill? — and nudge them all a hair in that direction. That's gradient descent, run with the chain rule (“backpropagation”), billions of times. Slowly, the noise becomes a model that knows the world.
Sine waves tell the model the order of the words.
Gradients tell every weight which way to improve.
Training is a flow downhill; sibling models (diffusion) solve one directly.
Words become arrows; dot products drive attention.
Every layer is a matrix that transforms the meaning.
Softmax makes the guess; cross-entropy grades it.
A language model is tokens turned into arrows, stamped with waves, mixed by attention, reshaped by matrices, collapsed into a probability, and tuned by gradients. Six ideas, each of which you can now picture with your own hands.
What makes a language model work was never out of reach. It was only ever a story told in scattered pieces — and now you have all of it, end to end.